Note: This is older data and the meaningful identifiers have been removed from the compounds (Mass, LC Retention Time). All compounds are only referenced by an arbitrary number (although the compounds were sorted by mass from smallest to largest before being given a number).
This is a mass spectrometry metabolomics experiment based on a project by a post-doc in my lab, who is in the process of creating a database of oxidation-senstive molecules and the products of their oxidation.
The motivation behind this project is that there is evidence that a variety of diseases may be caused by oxidative stress. Oxidative stress occurs when there is a large quantity of oxidizing molecules present in a cell, tissue, or organism, such that cannot be neautralized by the organism’s antioxidant defenses, which come in a variety of forms and can be produced by the organism itself or obtained from the diet. When oxidative stress is too high, the oxidizing molecules go rogue, chemically speaking, and can damage DNA, proteins, and other molecules within the cell. This type of cellular damage has been linked to diseases such as cancer and heart disease. Therefore, in order to understand and treat these diseases properly, we need to understand the nature of the oxidative stress. However, we have a limited understanding of all of the molecules that are oxidation-senstive, and what molecules might be products of oxidative stress.
To expand our knowledge about oxidative stress, a project has been initiated in my lab to create a “redoxome” database, which would act as a database of redox-related molecules, and that database should be ideally specific to the biological fluid or tissue of interest. The basic idea is to generate a database of “real” oxidation-sensitive molecules and the molecules produced by oxidative stress by taking a normal sample of the biological material that one is interested in studying, and subjecting it to one or more oxidizing potentials in a chemical oxidation cell (an artificially oxidizing environment).
In the case of this specific experient, there is a base/control “0mV” fraction, and 4 oxidized fractions, “250mV”, “500mV”, “750mV”, and “1000mV”. The fractions were run on our LC/MS system to quantify the abundances of the various molecules present in each sample. The features were then extracted in an “untargeted” way - as in it’s possible to extract features based only on compound mass and retention time on the LC system, without knowing their exact identity.
Generally, it’s a challenge to identify compounds based only on mass and retention time alone, especially as mass increases, because the potential chemical space increases and the ways that the atoms that make up that molecule can be connected increases. One way to identify compounds is to use a database, most labs will have their own database of compounds that they’ve run on their own system (while mass does not vary, retention time will vary with chromatography, and the separation of various isoforms will vary), or labs will adopat a chromatography system matched to a particular database. Identifying compounds based on mass is the next best thing, but it’s never a sure thing.
Unknowns can be subjected to what is known as MS/MS for identification - certain MS instruments can be used to subject samples to two rounds of MS. The first MS step is to select a peak or feature of interest, the second MS step is to fragment the compound. MS/MS is done on each compound of interest one by one to generate a clean fragment spectrum for each compound. The compound structure can then be guessed at because each molecule will break up in particular ways depending on which bonds are the most susceptible to breaking. Even when a compound structure is proposed, a standard is usually bought or synthesized (usually bought, because the standard has to be MS-grade purity with no potential confounding compounds), and the standard is run on the LC/MS system to compare retention time and MS/MS is performed to make sure the fragmentation patterns of the unknown and the standard are a match.
Hopefully it’s becoming clear that identifying a potentially interesting unknown is a lot of work and potententially expensive. Therefore, it’s extremely important to select well!
# packages
require(heatmaply)
require(magrittr)
require(tidyverse)
require(cowplot)
Another note that I should make, before getting too far, is that a mass spec can only detect charged molecules (as in a molecule with one or more total positive or negative charges). A MS instrument cannot measure an uncharged netural molecule and there is a number of ways to 1) give molecules a charge and 2) to detect the charged molcule. Also, there are two detection modes:
Some molecules can be picked up in both modes, others will be detected exclusively in one, but not the other. Depending on whether the quantification is relative or absolute (using a standard curve), the values for the same molecule in the two modes may not agree because molecules tend to be better at picking up one type of charge over another.
The set of samples considered in this analysis was analyzed in both negative and positive mode, and both datasets are analyzed here. n._ will denote negative mode data, whereas p._ will denote positive mode data.
# data
n.data <- read_csv(file = "QTOF_precolumn_redox_neg_untarget_nomassRT.csv", col_names = TRUE)
p.data <- read_csv(file = "QTOF_precolumn_redox_pos_untarget_nomassRT.csv", col_names = TRUE)
ncol(n.data)
## [1] 948
ncol(p.data)
## [1] 476
There are 946 Compounds (Features) associated with the negative mode samples (948 columns in the n.data because 1 column is sample names and the other contains the oxidizing potential measurement). There are 474 Compounds associated with the postive mode data. Important note: even though the same compound may be detected in both and negative mode, this is not always the case, and the nC1 is not necessarily the same as pC1 (and so on for the other numbered compounds).
Another quick note: n.X stands for negative mode data and p.X stands for positive mode data.
What the data looks like:
n.data
## # A tibble: 25 x 948
## Samples Potential nC1 nC2 nC3 nC4 nC5
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 n_mouse_0mV_1 0 72636.50 30439.83 26904.16 32375.49 48462.08
## 2 n_mouse_0mV_2 0 65962.64 29955.17 22888.43 34969.71 52285.44
## 3 n_mouse_0mV_3 0 61630.91 26862.67 19755.00 35382.19 44199.13
## 4 n_mouse_0mV_4 0 73057.04 28769.64 20462.82 32864.51 43231.48
## 5 n_mouse_0mV_5 0 70198.90 28441.28 23492.23 34235.84 42638.38
## 6 n_mouse_250mV_1 250 78256.57 31493.36 22720.86 37443.95 60448.74
## 7 n_mouse_250mV_2 250 75873.97 29715.76 26594.17 36876.72 69686.89
## 8 n_mouse_250mV_3 250 79573.65 30031.75 21298.50 36288.14 59219.87
## 9 n_mouse_250mV_4 250 77711.24 32062.44 19113.70 39309.13 56124.11
## 10 n_mouse_250mV_5 250 80901.66 28178.00 24193.07 37332.25 57955.01
## # ... with 15 more rows, and 941 more variables: nC6 <dbl>, nC7 <dbl>,
## # nC8 <dbl>, nC9 <dbl>, nC10 <dbl>, nC11 <dbl>, nC12 <dbl>, nC13 <dbl>,
## # nC14 <dbl>, nC15 <dbl>, nC16 <dbl>, nC17 <dbl>, nC18 <dbl>,
## # nC19 <dbl>, nC20 <dbl>, nC21 <dbl>, nC22 <dbl>, nC23 <dbl>,
## # nC24 <dbl>, nC25 <dbl>, nC26 <dbl>, nC27 <dbl>, nC28 <dbl>,
## # nC29 <dbl>, nC30 <dbl>, nC31 <dbl>, nC32 <dbl>, nC33 <dbl>,
## # nC34 <dbl>, nC35 <dbl>, nC36 <dbl>, nC37 <dbl>, nC38 <dbl>,
## # nC39 <dbl>, nC40 <dbl>, nC41 <dbl>, nC42 <dbl>, nC43 <dbl>,
## # nC44 <dbl>, nC45 <dbl>, nC46 <dbl>, nC47 <dbl>, nC48 <dbl>,
## # nC49 <dbl>, nC50 <dbl>, nC51 <dbl>, nC52 <dbl>, nC53 <dbl>,
## # nC54 <dbl>, nC55 <dbl>, nC56 <dbl>, nC57 <dbl>, nC58 <dbl>,
## # nC59 <dbl>, nC60 <dbl>, nC61 <dbl>, nC62 <dbl>, nC63 <dbl>,
## # nC64 <dbl>, nC65 <dbl>, nC66 <dbl>, nC67 <dbl>, nC68 <dbl>,
## # nC69 <dbl>, nC70 <dbl>, nC71 <dbl>, nC72 <dbl>, nC73 <dbl>,
## # nC74 <dbl>, nC75 <dbl>, nC76 <dbl>, nC77 <dbl>, nC78 <dbl>,
## # nC79 <dbl>, nC80 <dbl>, nC81 <dbl>, nC82 <dbl>, nC83 <dbl>,
## # nC84 <dbl>, nC85 <dbl>, nC86 <dbl>, nC87 <dbl>, nC88 <dbl>,
## # nC89 <dbl>, nC90 <dbl>, nC91 <dbl>, nC92 <dbl>, nC93 <dbl>,
## # nC94 <dbl>, nC95 <dbl>, nC96 <dbl>, nC97 <dbl>, nC98 <dbl>,
## # nC99 <dbl>, nC100 <dbl>, nC101 <dbl>, nC102 <dbl>, nC103 <dbl>,
## # nC104 <dbl>, nC105 <dbl>, ...
summary(n.data$Potential)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 250 500 500 750 1000
There are 5 samples per each oxidizing potential.
n.data$Potential <- factor(n.data$Potential,
levels = c(0, 250, 500, 750, 1000),
labels = c("0", "250", "500", "750", "1000"))
p.data$Potential <- factor(p.data$Potential,
levels = c(0, 250, 500, 750, 1000),
labels = c("0", "250", "500", "750", "1000"))
Metabolomics data tend to have a right-skewed distribution and it’s customary to log-transform the data. PCA is senstive to extreme values and so here, I will log-transform as well before the PCA analysis.
n.data.log <- bind_cols(
select(n.data, Samples:Potential),
n.data %>% select(nC1:nC946) %>% log()
)
p.data.log <- bind_cols(
select(p.data, Samples:Potential),
p.data %>% select(pC1:pC474) %>% log()
)
n.data.log %>%
select(nC1:nC946) %>%
as.matrix() %>%
`rownames<-`(n.data.log$Samples) %>%
scale(center = TRUE, scale = TRUE) %>%
heatmap(scale = "none")
p.data.log %>%
select(pC1:pC474) %>%
as.matrix() %>%
`rownames<-`(p.data.log$Samples) %>%
scale(center = TRUE, scale = TRUE) %>%
heatmap(scale = "none")
I normally use mixOmics::cim() to create heatmaps, but it’s not agreeing with me. Trying out heatmap for now, will fix the terrible color scheme later.
n.pca <- prcomp(n.data.log[, 3:ncol(n.data.log)], center = TRUE, scale = FALSE)
p.pca <- prcomp(p.data.log[, 3:ncol(p.data.log)], center = TRUE, scale = FALSE)
Proportion of variance explained by each principal component for both sets of data:
n.prop.var <- data.frame(
cbind(
1:length(n.pca$sdev),
n.pca$sdev^2 / sum(n.pca$sdev^2),
cumsum(n.pca$sdev^2 / sum(n.pca$sdev^2))
)
)
colnames(n.prop.var) <- c("PC", "Proportion", "Cumulative")
p.prop.var <- data.frame(
cbind(
1:length(p.pca$sdev),
p.pca$sdev^2 / sum(p.pca$sdev^2),
cumsum(p.pca$sdev^2 / sum(p.pca$sdev^2))
)
)
colnames(p.prop.var) <- c("PC", "Proportion", "Cumulative")
n.prop.var %>%
ggplot(aes(x = PC, y = Proportion)) +
geom_point(size = 3) +
geom_line(size = 1.5) +
xlab("Principal Component") +
ylab("Proportion of Variance Explained") +
ggtitle("Proportion of Variance Explained by Each Component\nNegative Mode")
n.prop.var %>%
ggplot(aes(x = PC, y = Cumulative)) +
geom_point(size = 3) +
geom_line(size = 1.5) +
xlab("Principal Component") +
ylab("Cumulative Proportion of Variance Explained") +
ggtitle("Cumulative Proportion of Variance Explained by Each Additional Component\nNegative Mode")
p.prop.var %>%
ggplot(aes(x = PC, y = Proportion)) +
geom_point(size = 3) +
geom_line(size = 1.5) +
xlab("Principal Component") +
ylab("Proportion of Variance Explained") +
ggtitle("Proportion of Variance Explained by Each Component\nPositive Mode")
p.prop.var %>%
ggplot(aes(x = PC, y = Cumulative)) +
geom_point(size = 3) +
geom_line(size = 1.5) +
xlab("Principal Component") +
ylab("Cumulative Proportion of Variance Explained") +
ggtitle("Cumulative Proportion of Variance Explained by Each Additional Component\nPositive Mode")
For both datasets, the first principal component explains about 77% of the variance. There’s a rapid drop-off: the second component in both cases explains about 13% of variance, and it’s only downhill from there. In both cases, the first 4 PCs are most likely to be useful.
n.components <- n.data %>%
select(Samples:Potential) %>%
bind_cols(data.frame(n.pca$x))
n.components %>%
ggplot(aes(x = PC1, y = PC2, group = Potential, color = Potential)) +
geom_point(size = 3.5) +
ggtitle("PC1 vs PC2 \n Negative Mode") +
xlab("PC1 (77.3% var)") +
ylab("PC2 (12.9% var)") +
ylim(-7.0, 7.0)
n.components %>%
ggplot(aes(x = PC2, y = PC3, color = Potential)) +
geom_point(size = 3.5) +
ggtitle("PC2 vs PC3 \n Negative Mode") +
xlab("PC2 (12.9% var)") +
ylab("PC3 (2.96%)") +
xlim(-7.0, 7.0) +
ylim(-3.5, 3.5)
n.components %>%
ggplot(aes(x = PC3, y = PC4, color = Potential)) +
geom_point(size = 3.5) +
ggtitle("PC3 vs PC4 \n Negative Mode") +
xlab("PC3 (2.96%)") +
ylab("PC4 (1.98%)") +
xlim(-3.5, 3.5) +
ylim(-3.0, 3.0)
It’s clear from the first plot that using PC1 to describe the data leads to the best separation of the groups. Also, it looks like the samples from the different electrical potentials are ordered nicely already, but that can be checked more rigorously with a linear model:
n.comp.num <- as.data.frame(
cbind(
as.numeric(as.character(n.components$Potential)),
n.components$PC1
)
)
colnames(n.comp.num) <- c("Potential", "PC1")
row.names(n.comp.num) <- n.components$Samples
n.PC1.model <- lm(PC1 ~ Potential, data = n.comp.num)
summary(n.PC1.model)
##
## Call:
## lm(formula = PC1 ~ Potential, data = n.comp.num)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9643 -1.1519 0.9274 1.2233 2.3228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.240e+01 6.094e-01 -20.34 3.35e-16 ***
## Potential 2.479e-02 9.952e-04 24.91 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.759 on 23 degrees of freedom
## Multiple R-squared: 0.9643, Adjusted R-squared: 0.9627
## F-statistic: 620.5 on 1 and 23 DF, p-value: < 2.2e-16
n.comp.num %>%
ggplot(aes(x = Potential, y = PC1, color = Potential)) +
geom_point(size = 3.5) +
geom_smooth(method = "lm", color = "black") +
ggtitle("PC1 vs Potential (mV)\nNegative Mode") +
xlab("Potential (mV)")
There is a clear strong positive relationship between PC1 and the electrical potential, although it doesn’t look to be perfectly linear.
p.components <- p.data %>%
select(Samples:Potential) %>%
bind_cols(data.frame(p.pca$x))
p.components %>%
ggplot(aes(x = PC1, y = PC2, color = Potential)) +
geom_point(size = 3.5) +
ggtitle("PC1 vs PC2 \n Positive Mode") +
xlab("PC1 (77.1%)") +
ylab("PC2 (12.7%)") +
ylim(-4, 4) +
xlim(-12.5, 12.5)
p.components %>%
ggplot(aes(x = PC2, y = PC3, color = Potential)) +
geom_point(size = 3.5) +
ggtitle("PC2 vs PC3 \n Positive Mode") +
xlab("PC2 (12.7%)") +
ylab("PC3 (2.93%)") +
xlim(-4.5, 4.5) +
ylim(-3, 3)
p.components %>%
ggplot(aes(x = PC3, y = PC4, color = Potential)) +
geom_point(size = 3.5) +
ggtitle("PC3 vs PC4 \n Positive Mode") +
xlab("PC3 (2.93%)") +
ylab("PC4 (1.52%)") +
xlim(-3, 3) +
ylim(-3, 3)
PC1 also has the best relationship with electrochemical oxidative potential for the positive mode data. However, the spread of the data points within the groups, especially in the 0mV group, hints that the data quality is questionable.
p.comp.num <- as.data.frame(
cbind(
as.numeric(as.character(p.components$Potential)),
p.components$PC1
)
)
colnames(p.comp.num) <- c("Potential", "PC1")
row.names(p.comp.num) <- p.components$Samples
p.PC1.model <- lm(PC1 ~ Potential, data = p.comp.num)
summary(p.PC1.model)
##
## Call:
## lm(formula = PC1 ~ Potential, data = p.comp.num)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2528 -0.8082 0.0183 1.0149 2.8852
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.1903598 0.5447709 -15.04 2.18e-13 ***
## Potential 0.0163807 0.0008896 18.41 2.91e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.573 on 23 degrees of freedom
## Multiple R-squared: 0.9365, Adjusted R-squared: 0.9337
## F-statistic: 339.1 on 1 and 23 DF, p-value: 2.907e-15
p.comp.num %>%
ggplot(aes(x = Potential, y = PC1, color = Potential)) +
geom_point(size = 3.5) +
geom_smooth(method = "lm", color = "black") +
ggtitle("PC1 vs Potential (mV)\nPositive Mode") +
xlab("Potential (mV)")
Note that the separation that makes the most sense is along PC1 (0mV to 1000mV spread out along the PC1 axis). PC2 may be interesting, but confusing, as there is some separation between groups, but 0mV and 1000mV are nearly at the same position, 250mV and 750mV are very similar, and 500mV is different from all other groups.PC3 may inform how the 0mV and the 250mV groups differ from the other three groups (and from each other).
n.PC1.load <- tbl_df(n.pca$rotation) %>%
mutate(Compounds = row.names(data.frame(n.pca$rotation))) %>%
select(Compounds, PC1) %>%
arrange(PC1)
n.PC1.load %>%
ggplot(aes(sample = PC1)) +
stat_qq() +
ggtitle("QQ Plot\nNegative Mode PC1") +
ylim(-0.4, 0.4)
#take the bottom 2.5% of loadings and the top 2.5% of loadings
n.select.PC1 <- subset(
n.PC1.load,
PC1 >= quantile(n.PC1.load$PC1, 0.975) |
PC1 <= quantile(n.PC1.load$PC1, 0.025)
)
n.select.PC1
## # A tibble: 48 x 2
## Compounds PC1
## <chr> <dbl>
## 1 nC642 -0.3100661
## 2 nC552 -0.2875755
## 3 nC170 -0.2446238
## 4 nC553 -0.2238861
## 5 nC643 -0.2195593
## 6 nC705 -0.2086450
## 7 nC706 -0.1967110
## 8 nC346 -0.1869810
## 9 nC424 -0.1767920
## 10 nC418 -0.1476915
## # ... with 38 more rows
# These compounds are much more manageable to plot in a heatmap figure
n.data.log %>%
select(one_of(n.select.PC1$Compounds)) %>%
as.matrix() %>%
scale(center = TRUE, scale = TRUE) %>%
`rownames<-`(n.data.log$Samples) %>%
heatmaply(scale = "none")
Looks like there’s a good selection of compounds picked up with this analysis with 2 major groups of compounds, with 2 sub-families in each:
Interesting compounds based on pos mode PC1.
p.PC1.load <- tbl_df(p.pca$rotation) %>%
mutate(Compounds = row.names(data.frame(p.pca$rotation))) %>%
select(Compounds, PC1) %>%
arrange(PC1)
p.PC1.load %>%
ggplot( aes(sample = PC1)) +
stat_qq() +
ggtitle("QQ Plot\nPositive Mode PC1") +
ylim(-0.3, 0.3)
p.select.PC1 <- subset(
p.PC1.load,
PC1 >= quantile(p.PC1.load$PC1, 0.975) |
PC1 <= quantile(p.PC1.load$PC1, 0.025)
)
p.select.PC1
## # A tibble: 24 x 2
## Compounds PC1
## <chr> <dbl>
## 1 pC175 -0.2817509
## 2 pC210 -0.2788191
## 3 pC195 -0.1936996
## 4 pC98 -0.1883896
## 5 pC113 -0.1708991
## 6 pC105 -0.1354455
## 7 pC168 -0.1293160
## 8 pC69 -0.1229430
## 9 pC90 -0.1132040
## 10 pC117 -0.1084555
## # ... with 14 more rows
p.data.log %>%
select(one_of(p.select.PC1$Compounds)) %>%
as.matrix() %>%
scale(center = TRUE, scale = TRUE) %>%
`rownames<-`(p.data.log$Samples) %>%
heatmaply(scale = "none")
The positive mode data. as hinted by the PCA analysis, is more variable than the negative mode analysis and the clustered data isn’t as neat. It looks like there’s some compounds that are oxidation-senstive that have been picked up by the analysis, but no oxidation products.
What to do after this point?